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ABSTRACT 

Classification evaluation metrics are often used to evalu- 
ate adaptive tutoring systems — programs that teach and 
adapt to humans. Unfortunately, it is not clear how in- 
tuitive these metrics are for practitioners with little ma- 
chine learning background. Moreover, our experiments sug- 
gest that existing convention for evaluating tutoring systems 
may lead to subopt imal decisions. We propose the Learner 
Effort- Outcomes Paradigm (Leopard), a new framework to 
evaluate adaptive tutoring. We introduce Teal and White, 
novel automatic metrics that apply Leopard and quantify 
the amount of effort required to achieve a learning outcome. 
Our experiments suggest that our metrics are a better alter- 
native for evaluating adaptive tutoring. 
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1. INTRODUCTION 

A fundamental part of the scientific and engineering pro- 
cess is testability — the property of evaluating whether a 
hypothesis or method can be supported or falsified by data 
of actual experience. For example, in educational data min- 
ing, we formulate testable hypotheses that claim that the 
methods we engineer improve the outcomes of learners. In 
this manuscript, we study how to verify learner outcome 
hypotheses. 

We focus on evaluating a popular type of educational method 
called adaptive intelligent tutoring system. Adaptive sys- 
tems teach and adapt to humans; their promise is to im- 
prove education by optimizing the subset of items presented 
to students, according to their historical performance [^, 
and on features extracted from their activities [^. In this 
context, items are questions, problems, or tasks that can be 
graded individually. 
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Evaluation metrics are important because they quantify the 
extent of whether an educational system helps learners. For 
example, a practitioner may use an evaluation method to 
choose which of the alternative adaptive tutoring systems 
to deploy in a classroom, or school district. On the other 
hand, a researcher may be interested in quantifying the im- 
provements of her system compared to previous technology. 

Our main contributions are proposing a novel evaluation 
paradigm for assessing adaptive tutoring and examples of 
when traditional evaluation techniques are misleading. This 
paper is organized as follows: § re views related methods 
for evaluating adaptive systems; §]^ describes the paradigm 
we propose for automatic evaluation of tutoring systems; 
§[^ provides a meta-evaluation of our novel evaluation tech- 
niques; and, §[^ provides some concluding remarks. 

2. BACKGROUND 

Adaptive tutoring is often implemented as a complex sys- 
tem with many components, such as a student model, con- 
tent pool, and a cognitive model. Adaptive tutoring may 
be evaluated with randomized control trials. For example, 
in a seminal study that focused on earlier adaptive tu- 
tors, a controlled trial measured the time students spent on 
tutoring and their performance on post-tests. The study re- 
ported that the tutoring system enabled significantly faster 
teaching, while students maintained the same or better per- 
formance on post-tests 

Unfortunately, controlled trials can become extremely ex- 
pensive and time consuming to conduct: they require insti- 
tutional review board approvals, experimental design by an 
expert, recruiting (and often payment!) of enough partici- 
pants to achieve statistical power, and data analysis. Au- 
tomatic evaluation metrics improve the engineering process 
because they enable less expensive and faster comparisons 
between alternative systems. Fields that have agreed on 
automatic evaluation have seen an accelerated pace of tech- 
nological progress. For example, the widespread adoption 
of the Bleu metric in the machine translation commu- 
nity has lowered the cost of development and evaluation of 
translation systems. At the same time, it has enabled ma- 
chine translation competitions that result in great advances 
of translation quality. Similarly, the Rouge metric has 
helped the automatic summarization community transition 


from expensive user studies of human judgments that may 
take thousands of hours to conduct, to an automatic metric 
that can be computed very quickly. 

The adaptive tutoring community has tacitly adopted con- 
ventions for evaluating tutoring systems E [^. Re- 
searchers often evaluate their models with classification eval- 
uation metrics that assess the student model component of 
the tutoring system — student models are the subsystems 
that forecast whether a learner will answer the next item 
correctly. Popular classification evaluation metrics include 
accuracy, log-likelihood. Area Under the Curve (AUC) of 
the Receiver Operating Characteristic curve, and, strangely 
for classifiers, the Root Mean Square Error. However, au- 
tomatic evaluation metrics are intended to measure an out- 
come of the end user. For example, the PARADISE 
metric used in spoken dialogue systems correlates to user 
satisfaction scores. Not only is there no evidence that sup- 
ports that classification metrics correlate with learning out- 
comes; but, prior work has identified serious problems 
with them. For example, classification metrics ignore that 
an adaptive system may not help learners — which could 
happen with a student model with a flat or decreasing learn- 
ing curve . A decreasing learning curve implies that 

student performance decreases with practice; this curve is 
usually interpreted as a modeling problem, because it op- 
erationalizes that learners are better off with no teaching. 
Therefore, an adaptive tutor with a student model with a 
decreasing learning curve does not teach students. 

Surprisingly, in spite of all of the evidence against using clas- 
sification evaluation metrics, their use is still very widespread 
in the adaptive literature [^[^[^. Moreover, there is very 
little research on alternative evaluation techniques. A no- 
ticeable exception is recent work on individualizing student 
models [^. The authors evaluated their approach using a 
method called ExpOppNeed^ which calculates the expected 
number of practice opportunities that learners require to 
master the content of the tutoring curriculum. Though their 
evaluation methodology is extremely interesting and promis- 
ing, it was not intended to be generalizable. In the next 
section we extend on prior work and present a novel general 
paradigm for evaluating adaptive systems. 

3. LEOPARD EVALUATION 

Adaptive tutoring implies making a trade-off between min- 
imizing the amount of student effort, by carefully personal- 
izing the curriculum, and maximizing student outeomes |^. 
For example, repeated practice on a skill may improve stu- 
dent proficiency, at the cost of a missed opportunity for 
teaching new material. Adequate values for student effort 
and outcomes respond to external expectations from the so- 
cial context. For example, it is not acceptable for a tutor 
to minimize effort by not teaching any content at all, or to 
maximize outcomes by taking twenty years to teach a sim- 
ple concept. The right trade off is defined by subject matter 
experts. 

We propose the novel Learner Effort-Outcomes Par adigm 
(Leopard) for automatic evaluation of adaptive tutoring. At 
its core. Leopard quantifies the effort and outcomes of stu- 
dents in adaptive tutoring. Even though measuring effort 
and outcomes is not novel by itself, our contribution is mea- 


suring both without a randomized control trial. 

• Effort: Quantifies how much practice the adaptive tu- 
tor gives to students. In this paper we focus on count- 
ing the number of items assigned to students but, al- 
ternatively, amount of time could be considered. 

• Outcome: Quantifies the performance of students after 
adaptive tutoring. For simplicity, we operationalize 
performance as the percentage of items that students 
are able to solve after tutoring. We assume that the 
performance on solving items is aligned to the long- 
term interest of learners. 


We argue that Leopard is more intuitive than classification 
metrics because the effort and outcome resonate to educa- 
tional principles. We now describe two novel metrics that 
apply the Leopard philosophy. In § |3.1| we describe Teal, a 
metric that calculates the theoretical expected behavior of 
student s wh en interacting with a family of student models; 
and in § 3^ we describe Whit^a metric that uses empirical 


data that may have not been collected on a control trial. 


3.1 Theoretical Evaluation of Adaptive Learn- 
ing Systems (Teal) 

We formulate Theoretical Evaluation of Adaptive Learning 
Systems (Teal) to evaluate adaptive tutoring from the ex- 
pected behavior of their student model. Teal focuses on 
models of the Knowledge Traeing Family — a very popular 
set of student models p^ . 


To use Teal on data collected from students, we first train a 
model using an algorithm from the Knowledge Tracing fam- 
ily (§ |3.1.1| , then w e use the learne d para meters to calculate 
the effort (§ 3.1.2) and outcome (§ |3.1.3 ) for each ski ll. We 
discuss how to use Teal on models that use features (§ |3.1.4 ) 
and our design decisions (§ 3.1.5). 


3.1.1 Knowledge Tracing Family 
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Figure 1: Knowledge Tracing plate diagram. The color of 
the circles represent whether the variable is latent (white) , or 
observed in training (light), and plates represent repetition. 

Figure describes the Knowledge Tracing model, the 
most simple member of the family. Knowledge Tracing re- 
quires a mapping of items to skills, often built by subject 
matter experts, although automatic approaches exist [^. 
These skill mappings are also called cognitive models, or Q- 
matrices. Knowledge Tracing uses a Hidden Markov Model 
(HMM) per skill to model the student’s knowledge as latent 
variables. The binary observation variable ^ represents 

^Tradition names metrics like colors! E.g., Rouge, Bleu. 
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whether the student u applies the practice opportunity 
of skill q correctly. The latent variable ^ models the latent 
student proficiency, which is often modeled with a binary 
variable to indicated mastery of the skill. To declutter nota- 
tion, we may not explicitly write the indices q and u. There 
are two conventions for naming the skill-specific parameters 
of Knowledge Tracing. In the HMM tradition, the parame- 
ters are simply named transition or learning (T), and emis- 
sion (E). In the educational tradition when using two latent 
states the parameters are called initial knowledge (Co), learn- 
ing (C), forgetting (/*), guess (g) and slip ( 5 ). The Knowl- 
edge Tracing family includes models that parameterize the 
emission probabilities, transition probabilities, or both. For 
example, in Knowledge Tracing, the emission probability of 
emitting an answer y when the student has knowledge k is: 

■£y,k=p(y|k) (1) 

Which is simply a binomial probability. To allow features 
in the emissions, we replace the binomial with a logistic 
regression [10] : 

2^,k(|3,X0=p(y|k;|3,X0 (2) 


1 + exp(-(3'^ • Xt) 

Here Xt is the feature vector extracted at time t, and |3 is 
the regression coefficient vector. The feature may indicate, 
for example, if the student requested a hint. 

5 . 7.2 Ejfort 

Teal calculates the expected number of practice that an 
adaptive tutor gives to students. We assume a policy that 
the tutor stops teaching a skill once the student is very likely 
to answer the next item correctly according to a model from 
the Knowledge Tracing Family. For notational convenience, 
we define the probability of answering the next item cor- 
rectly as: 

Ci+i(yi,...,yr) =p(yt+i = correct|yi, . . . ,y«; X, E) (4) 

Here L and £ are the parameters of the Knowledge Trac- 
ing Family model. We can estimate ct+i using conventional 
inference techniques for HMMs [^, such as the Forward- 
Backward algorithm. 

The adaptive tutor teaches an additional item if two condi- 
tions hold: (i) it is likely that the student will get the next 
item wrong — in other words, the probability of answering 
correctly the next item is below a threshold R] and (ii) the 
tutor has not decided to stop instruction already. More for- 
mally, the tutor keeps teaching if: 

if V Ct/+i(yi,...,yt/) < R 

t'<t 

otherwise 

( 5 ) 

We now can calculate at which practice opportunity the tu- 
tor should stop instruction. For simplicity, we assume all 
sequences are of length T. We simply count all of the times 
the tutor decides to teach a new item: 

T 

costii(yi, . . . ,yr) = teach (y i, . ,.,yt,R) (6) 

t = l 


teach (y 1 , . ..,yt,R) = 


Note that if the probability of answering correctly the next 
item has not reached the threshold in T time steps, the cost 
is dehned as T. Teal defines effort as the expected value of 
the number of practice opportunities a tutor gives. This is: 


effort (i^) = E (cost i^(TT)) (7) 

= W costi{(yi,...,,yT)- p(yi,...,yT) 

yiv^yTCyT amount of practice sequence likelihood 


(8) 


Here, Tt is the set of all sequences of length T. When we 
have binary student outcomes (correct or not), the cardi- 
nality of this set is 2^, which makes Teal only tractable for 
sequences of a few dozens of observations. In our experience, 
the sequences of adaptive tutoring systems are often in this 
range. In a companion paper we give an alternative for- 
mulation of Teal that allows approximate calculations. The 
likelihood of the sequence can be efficiently estimated using 
the Forward- Backward algorithm. 


3.1.3 Outcome 

We define the outcome of a student as the mean performance 
after the tutor should stop instruction. For a particular 
sequence with student cost k — costi?(yi, . . . , Yt), this is: 


outcome(yi,...,yT,/c) 


mean(y/c . . . yr) iCk <T 
impute value otherwise 


( 9 ) 

We map the correct and incorrect student responses yt into 1 
or 1 , respectively. If the student sequence does not reach the 
performance threshold, we impute the value of the outcome. 
In this paper, we set the imputation value to 0. We define 
the score as the expected value of the outcome: 


score(i^) = E (outcome (T t, A;)) 

= ^ outcome(yi,...,yT,i^) -p(yi 

yi,---,yTC3^T 


(10) 

• • -yr) 

( 11 ) 


3.1.4 Usage on Models With Features 

For models that parameterize emission or transitions we first 
must build a counterfactual feature vector X, and use it to 
calculate model parameters that do not depend on features. 
For example, consider a model that uses a binary feature 
vector that encodes students in different conditions. Condi- 
tions can be any feature of interest of the tutoring system, 
such as the ability to display multimedia content. We can 
use Teal to calculate the effort of students in each of the 
specific conditions. 

For example, consider a feature vector X = (fi, f 2 , . . . , /n)- 
Feature fi is 1 iff the student is using condition 1 (e.g., mul- 
timedia content is available), feature /2 is 1, iff the student is 
using condition 2, etc. The vector is all zeros if the student 
is in the control condition. If we activate feature f ± , we can 
calculate the effort or score of students in the treatment 1. 
To apply Teal we first estimate counterfactual slip and guess 
parameters using Equationj^ We can use the counterfactual 
parameters with Teal. 
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For some models with features, Teal may require that stu- 
dents are assigned randomly to feature activation condi- 
tions, so that the regression coefficients can be interpreted 
as causal effects. Teal may not be appropriate if - for ex- 
ample - the features have reverse causality, or if there are 
omitted variables in the model. 

3.1.5 Design Discussion 



Figure 2: Expected and empirical student performance for 
a skill {Co = 0.3, C = 0.25, g = 0.3, 5 = 0.3, / = 0). 

Teal extends the ExpOppNeed algorithm discussed on § [^ 
We compare both approaches to justify our design decisions. 

1. When to stop tutoring. Teal expects tutoring to 
stop once the student is very likely to apply the skill 
correctly. On the other hand, ExpOppNeed relies on 
stopping tutoring once the posterior probability of the 
latent variable for knowledge is above a threshold. Eig- 
ure compares both approaches for some Knowledge 
Tracing parameters. The solid lines represent the ex- 
pected values derived theoreticall>0for both strategies. 
To illustrate what actual student behavior may look 
like, we plotted dotted lines for 50 synthetic students 
sampled from a HMM. Although individual students 
vary, their average behavior is close to theoretical. 

In the figure, with 15 practice opportunities the stu- 
dents have close to 100% probability of skill mastery, 
while they only have 65% probability of applying the 
skill correctly. This big gap between the probability 
of mastery and probability of correct (the two solid 
lines) implies that the model is defining mastery as a 
state when students have low probability of applying 
the skill correctly. Low probability of answering cor- 
rectly in a mastery state can occur due to a number of 
problems, for example, an incorrect item-to-skill map- 
ping, or confusing tutoring content. We argue that an 
evaluation metric should penalize such models to be 
consistent with the Mastery Learning Theory [^. 

Moreover, prior work has demonstrated that some 
ill-defined models have probability of correct decreas- 
ing with practice opportunities, at the same time that 
the probability of mastery increases. ExpOppNeed 
does not penalize such ill-defined models, but Teal 
does. 

^Prior work derived 1^: ^{yt — correct) = 1 — — Aj3^ . 

Here, (5 — {1 — C)^ and A — {1 — s — g) ■ {1 — Co) 
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Algorithm 1 Single-Skill White 

Require: performance sequences yn,g,t, student model pre- 
dictions Cu,q,t (the subscripts index students, skills, and 
practice opportunities), threshold R 
1: function WHiTE(yn,g,t, Cu,g,t, R) 

2: for each student u do 

3: for each skill q do 

4: > Select data for student u and skill q only: 

5: y', c' ^ fllter(y, c, u, q) 

6: effort(g, u) ^ 0 

7: for each practice opportunity t in y' do: 

8: if cj+i > R then 

9: score(g, u) ^ mean(yt+i , . . . , t/t) 

10: next skill q 

11: else if last(t) then 

12: score(g,u) ^ impute 

13: effort(^,u) ^ effort(g, u) + 1 

return effort, score 


2. What to measure. ExpOppNeed does not calculate 
expected outcome of students. Teal considers both stu- 
dent outcome and effort because it is trivial to optimize 
one of the metrics if the other one is ignored. 

3. Precision of the results Both ExpOppNeed and 
Teal have exponential computational complexity. How- 
ever, ExpOppNeed uses a heuristic to prune sequences 
with low probability. Unfortunately, if the effort is 
very high (or infinite), the likelihood of the individual 
sequences becomes very low, and ExpOppNeed prunes 
the sequences too soon and therefore it may underes- 
timate the effort. Teal improves on ExpOppNeed by 
defining effort on fixed- length sequences and not doing 
pruning. 

We now summarize some limitations of our approach. Teal 
assumes that the model parameters are correct, and does 
not take into account potential modeling problems — such 
as misspecification, or over-fitting. By design. Teal only is 
able to evaluate models in the Knowledge Tracing Eamily. 
We now present a novel evaluation method that addresses 
these limitations. 

3.2 Whole Intelligent Tutoring System Empir- 
ical Evaluation (White) 

We propose Wh ole Intelligent Tutoring System Evaluation 
(White), a novel automatic method that evaluates the rec- 
ommendations of an adaptive system using data. White 
does not assume the student data is generated by a Knowl- 
edge Tracing model; instead, it relies on counterfactual sim- 
ulations. White reproduces the decisions that the tutoring 
system would have made given the input data on the test 
set, by counting how many items the adaptive tutor would 
ask students to solve, and what is the mean student perfor- 
mance after tutoring. 

Algorithm describes White for a tutoring system that as- 
sumes an item is assigned to exactly one skill. We leave more 
complex tutors for future work. The input of White is the 
student performance sequences y, the predictions of answer- 
ing correctly c, and a threshold R that defines what is the 




Figure 3: Example of White calculating counterfactual score 
and effort using empirical data {R = 0.6). 


target probability of correct. White assumes that the stu- 
dents are a random sample of the student population. The 
predictions are calculated by the student model component 
of the adaptive tutoring. For a data-driven student model, 
the predictions can be informed with the history preceding 
the current time step. For instance, to predict on the third 
time step, the student model may use the data up to the 
second time step. For example, for Knowledge Tracing: 


• Conventional metrics. We use classification evalu- 
ation metrics to evaluate how the student models pre- 
dict future student performance. For this, we allow 
student models to use the history preceding the time 
step we want to predict. 

• Leopard metrics. We use the score and effort as 
calculated by White and Teal. For simplicity we report 
the average scores across skills, and the sum of the 
mean effort. For U students and Q skills, this is: 


dataset score(i^) = 


dataset effort (i?) = 


q u 

-j Q u 

effort (g, u) 

q u 


(13) 

(14) 


4.1 Real Student Data 

We use data collected from a commercial non-adaptive tu- 
toring system for middle school Math. Our dataset includes 
only the first part of the entire curriculum, and contains stu- 
dents from the same grade from multiple schools. It contains 
approximately 1.2 million observations from 25,000 students. 
We randomly split the dataset into three sets of students. 
The training and test set have 60% and 20% of the students, 
respectively. The remainder of the data is reserved for future 
experiments not described in this paper. The item bank was 
mapped to skills in three different ways — the coarse defini- 
tion maps the items into 27 skills, the fine definition into 90 
skills, and the proprietary one is not reported. 


ct = p{yt = correct|yi, . . . ,yt-i) (12) 


Figure 1^ shows example data of how White works for a 60% 
threshold {R = 0.6). For each student and skill in the test 
set. White estimates their counterfactual effort — how many 
items the student would have solved using the tutoring sys- 
tem. In our example, Alice does not get to practice the 
skill because the student model believes that she is likely to 
already know it (effort =0), but Bob is given one practice op- 
portunity (effort=l). After Bob answers correctly the item, 
he is not given any more practice. White also calculates a 
counterfactual score to represent the student learning. It is 
the percentage of correct answers after the instruction would 
have stopped. The score is related to an existing classifica- 
tion evaluation metric called precision. Precision aggregates 
the entire dataset, while score is computed by students and 
skills. Although superficially it may sound as a small dif- 
ference, our strategy allows us to avoid a special case of the 
Simpson’s Paradox. In § |4.1.1| we discuss the issue more. 

In this paper, when we report results with White, we impute 
the score of students that do not reach the threshold with 
their average performance. This is deliberately a different 
imputation strategy that we use with Teal, which assigns a 
score of zero to students that do not reach the threshold. 


4. META-EVALUATION 


In this section we met a- ev alua te Leopard. We ex perim ent 
with data from students (§ 4.1) and simulations (§ 4.2). 


We compare these sets of metrics: 


4.1.1 Are predictive models always useful? 

Assessing an evaluation metric with real student data is dif- 
ficult because we often do not know the ground truth. To 
get around this, we now describe a strategy to select a sub- 
set of the dataset that we know the behavior of. Our main 
insight is that for adaptive tutoring to be able to optimize 
when to stop instruction, the student performance should in- 
crease with repeated practice (the learning curve should be 
increasing). Our strategy consists on selecting the subset of 
the data where student modeling may fail, because student 
performance remains flat or decreases with practice. 


We first train a simplified Performance Factors Analysis 
(PFA) model. We use a logistic regression for each skill: 




1 

1 + exp(|3« • X9)) 


(15) 


The dimensions of are the count of prior correct re- 
sponses of the student and an intercept. We learn the pa- 
rameters of the model (3^ using constrained optimization — 
the regression coefficient for the effect of prior correct re- 
sponses has to be non- negative. 


We only use data from the skills that have zero regression 
coefficient for the effect of prior correct responses (flat or 
decreasing learning curve). Such skills are not suitable for 
an adaptive tutor because the PFA student model believes 
that practice does not influence student performance. More 
concretely, this PFA model would give infinite practice to 
difficult skills, or no practice to easy skills. Table [^compares 
the results of using White and two conventional metrics on 
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the test set of the selected skills. We compare with a ma- 
jority class model that always predicts students answers as 
correct. The conventional metrics we report are the AUC, 
because of it’s popularity, and the F-metric, because in ex- 
periments we report later correlates highly with White. For 
White we use a threshold of 60%. We cannot report on Teal 
because PFA is not part of the Knowledge Tracing Family. 


Table 1: Evaluation metric comparison. 



White 

conventional 


score effort 

E AUC 

Performance Eactors Analysis 

.18 

10.1 

.79 .85 

Majority Class 

.18 

11.2 

0 .50 


The AUC and F-metric results are arguably very high, in- 
dicating that the PFA model is highly predictive — yet by 
construction, we know that the model is not useful for adap- 
tivity. The high prediction power of PFA is explained only 
by the intercepts of the model. That is, the predictions are 
based on the skill difficulty, independently of the student 
performance. We argue that White communicates better 
the unfavorable nature of the model because it reports a 
very low score, and only a small improvement of effort when 
compared to a baseline. 

The problem with metrics that aggregate over the entire 
dataset, like the AUC and the F-metric, can be explained 
by Simpson’s paradox — a trend that appears in different 
groups of data that disappears or reverses when the groups 
are combined. Because adaptive tutors learn a model from 
each skill independently, it is effectively a group of models. 
White and Teal evaluate each skill independently and are 
not susceptible to this problem. Consider the alternatives: 

• Reporting as a baseline the difficulty classifier — a clas- 
sifier that only considers the fraction of correct answers 
of each skill in the training set. For example, in Ta- 
ble!^ the PFA model has an AUC of 0.8, the same as 
the difficulty classifier. Because PFA did not outper- 
form this baseline, it suggests the student model has 
a problem. However, simulations provide evidence 
that useful student models may have predictive per- 
formance similar to the difficulty classifier. Therefore, 
the difficulty classifier baseline may reject some useful 
student models. Moreover, convention expects classi- 
fiers to have an AUC of higher than 0.5 to be useful, 
and this new baseline would break this interpretation. 

• Calculating classification metrics over skills indepen- 
dently. This would only be useful when the skills are 
known beforehand, and not discovered with data [^. 
We now provide evidence that suggests that classifica- 
tion metrics may be misleading, even when they are 
not affected by the Simpson’s paradox. 

4.1.2 Do traditional metrics lead to good decisions? 
We now compare Leopard and traditional metrics for choos- 
ing an item-to-skill mapping. We train a PFA model using 
our Math dataset. Table compares the results of White 
{R = 0.6) and AUC. 

If we were to choose the best skill mapping by AUC alone, we 
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Table 2: Comparisons of item-to-skill definitions. 



White 



score effort 

AUC 

coarse 

.41 

55.7 

.69 

fine 

.36 

88.1 

.74 


would choose the finer item-to-skill mapping, while White 
selects the coarser one. Why do they disagree? The fine 
skill mapping has almost three times the number of skills (90 
skills) than the coarse mapping (27 skills). This means that 
for the effort to be the same on both models, the hner model 
should give a third of the practice of the coarser model. Even 
though the finer model is slightly more predictive, we argue 
that the coarser model is better suited for adaptive tutoring. 

4.1.3 Case Study 

Eor completeness. Table demonstrates using different stu- 
dent modeling techniques with the coarse item-to-skill map- 
ping. Eor Knowledge Tracing, we show both the White es- 
timates, and the Teal estimates (in parenthesis). We use 
the average sequence length for each skill because Teal re- 
quires a sequence length as an input. The estimates of 
Teal and White for effort are very similar, but their scores 
mismatch — possibly due to the differences in imputation 
for skills that don’t reach the threshold. The low score met- 
rics are indicative of students not reaching the performance 
threshold. This suggests that further inspection is necessary, 
because the learning curves may be decreasing or some some 
skills may have high slip probabilities. One of the advantages 
of White is that it can be used to evaluate non-probabilistic 
student models. Eor example, we use White to evaluate the 
student model that gives practice of a skill until the student 
gets three correct answers in the skill. 

Table 3: Student model comparison using Leopard 


Leopard 

score effort AUC 


Knowledge Tracing 

.39 

(.18) 

49.5 

(50.9) 

.70 

Performance Eactor Analysis 

.41 

55.7 

.69 

Three Correct 

.39 

59.1 

n/a 

Majority Class 

.41 

65.6 

.50 


4.2 Simulations 

With real data, we do not know the extent that the parame- 
ters are learned correctly, or affected by modeling problems — 
such as misspecification. We now use synthetic data to eval- 
uate different metrics and compare them to a ground truth. 
Given that we know the Knowledge Tracing parameters that 
were used to generate the synthetic datasets, we can use Teal 
to calculate exactly the student effort and outcomes. 

We sample 500 different datasets using random Knowledge 
Tracing parameters. In none of the datasets we allow forget- 
ting, but we do not impose any other constraint (not even 
that students improve with practice) . Each dataset has only 
a single skill, and has 200 students with 10 practice oppor- 
tunities. We do not learn parameters from the synthetic 


dataset, so we do not cross- validate. 


4.2.1 Which metrics correlate best with the truth? 
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Figure 4: Correlation matrix of Leopard and conventional 
metrics. The size of the circles indicate the magnitude of 
the Pearson p correlation coefficient. 
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Figure shows the pairwise Pearson-p correlations across 
500 synthetic datasets on Teal (score), Teal (effort). White 
(effort). White (score), F-metric, Log- likelihood, RMSE, AUC, 
and Accuracy. 

The metrics that correlate the most with the ground truth 
are White and the F-metric. Interestingly, the ground truth 
effort and score have low correlation with all the conven- 
tional metrics, except the F-metric, but the conventional 
metrics have relatively high correlation among each other 
(except the F-metric). In other words, most conventional 
metrics seem to be exchangeable. 

We now investigate the effect of the imputation strategy of 
White. We are mindful that all of the synthetic students 
have 10 practice opportunities. Therefore, if White reports 
an effort of 10 for a dataset, it is likely that the dataset is 
not suitable for adaptivity, and that White may be imputing 
missing data to calculate the score. Figure [^compares the 
324 datasets that White reports effort lower than 9.99. Each 
dot in the scatterplot represents a different dataset. We see 
that effort computed with White has an almost perfect cor- 
relation with the ground truth (p = 1.00, p<0.05). On the 
other hand, the score computed with White is affected by 
our imputation strategy, but still has near perfect correlation 
(p = 0.98, p<0.05) with the ground truth. The correlation 
of the E-metric with the ground truth effort (p = —0.47) and 
score (p = 0.89) is relatively lower than White’s. E.g., when 
the ground truth effort is 0, the E-metric ranges from very 
bad (0.2) to very good (1.0) predictive power, but White’s 
effort is close to 0. Moreover, we speculate that score and 
effort may be more relatable to practitioners with little back- 
ground of machine learning than the E-metric. 

4.2.2 Does White Converge to True Values? 

We now investigate whether White converges to the true val- 
ues calculated by Teal. We use the same parameters used to 
plot Eigure and we manipulate the number of synthetic 


Eigure 5: Comparison between E-metric and White to the 
ground truth. 


students, each student with 20 practice opportunities, Eig- 
ure shows that with little data. White converges to the 
true value computed by Teal. Euture work may provide a 
formal argument of when and how much data White requires 
to convergence. 



Eigure 6: Example of White converging to Teal. 

5. DISCUSSION 

Our main contribution is the Leopard framework that auto- 
matically assesses adaptive tutoring systems in dimensions 
that relate to learner effort and outcomes. These dimen- 
sions were previously measured only in randomized control 
trials. We present Teal and White, two novel metrics that 
apply Leopard and are useful to evaluate adaptive tutoring 
systems. Secondary contributions include a novel method- 
ology to assess evaluation metrics, the insight of Simpson’s 
paradox affecting adaptive tutoring evaluation, and the im- 
plementation of the techniques we propose in this pape]0 

Classification evaluation metrics are very widespread in many 
disciplines, and their use in education is very important. 

^ http : // j osepablogonzalez . com 
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For example, for Computer- Adaptive Testing (CAT), classi- 
fication metrics provide very useful insights to psychometric 
models. Leopard is not intended to replace classification 
metrics, randomized control trials, automatic experimenta- 
tion [^, or visualization approaches min]- Leopard is 
a complementary approach to existing techniques, and we 
claim that it is specially useful when in vivo and online ex- 
perimentation is not feasible. 

We argue against the de facto standard of evaluating adap- 
tive tutoring solely on classification metrics. Our experi- 
ments on real and synthetic data reveal that it is possible 
to have student models that are very predictive (as mea- 
sured by traditional classification metrics), yet provide little 
to no value to the learner. Moreover, when we compare 
alternative tutoring systems with classification metrics, we 
discover that they may favor tutoring systems that require 
higher student effort with no evidence that students learn 
more. That is, when comparing two alternative systems, 
classification metrics may prefer a suboptimal system. 

An interesting future direction may be to relax Teal’s as- 
sumption that all sequences have fixed- length. Future work 
may provide more rigorous theoretical analysis on conver- 
gence, confidence intervals, validate our metrics with ran- 
domized control trials, or derive White for policies with mul- 
tiple skills per item. 

We are excited to see future work in adaptive tutoring sys- 
tems reporting their contributions in terms of learner effort 
and outcomes. Besides the technical contributions of our 
evaluation metrics, we hope that our work contributes to 
the mission of driving the student modeling community to 
have a more learner-centric perspective. 
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